Exploring White Wine Dataset with R by Mohamed Firas M.

White Wine Quality the data was downloaded from the following site:White Wine Quality as project 6 of Data Analyst for Enterprise Nanodegree Program

Univariate Plots Section

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

explore the data

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

there are aa 4898 observations and 13 variables but we can find that X variables is a counter for observation ,it is better to dropped it

since Quality are a measure we should change the Quality Variables to the ordinal catagorical variable

quality factorize it

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

print statistical summary for the data

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality.cat
##  Min.   :3.000   3:  20     
##  1st Qu.:5.000   4: 163     
##  Median :6.000   5:1457     
##  Mean   :5.878   6:2198     
##  3rd Qu.:6.000   7: 880     
##  Max.   :9.000   8: 175     
##                  9:   5

Univariate Analysis

Frequncy of the quality of the wine

the meausr of quality is between 3 to 9 , looking how the observation ’s quality are distrbuite

frequncy of the quality of the wine in table

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The distribution of the quality of wine looks normal the lowestquality are 3 and hight are 9, there is no 1,2 or 10 marks for rating it is better to bucket the quality to 3 class ( Low ,Average , High )

The first thing to look for alcoholic beverages is alcohol percent . /n

alcohol

An appropriate level of alcohol enhances the flavor, but more of it could give you low quality . The median is 10.4% and the majority of values fall between 9% to 13%.

PH

PH has a small range between 2.7 to 3.8 which mean acetic!

Fixed acidity

The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.

The distribution of acidity is very close to normal distribution. But there exist some outliers in the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

volatile acidity

volatile acidity (acetic acid - g / dm3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

change the histogram to get clear picture

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Citric.acid

The citric acid distribution looks quite normal with a median of 0.32 and a mean of 0.3342. and there is a peak too at around 0.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Residual Sugar

Residual sugar has a wide range between 0.6-65.8g/l while the median is only 5.2g/l. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.

After adjusting the histogram

chlorides

Most wines has an amount of sodium chloride between 0.025-0.06g/l, with a median of 0.043g/l. The highest level in this dataset is 0.346g/l.

Looks like there are some outliers in this distribution.

Sulphate

sulphate distribution plots seem approximately normal, with a median of 0.4700 and a mean of 0.4898. Again in this second plot we are dropped the top and bottom 1% of sulphate values. These plots seem slightly bimodal or trimodal, however, I don’t really know. The peaks seem too close together to classify this plot as bimodal, but maybe I should be considering the small scale of this data..

The data seem to be a bit right skewed. However, there exist very few outliers. Let’s narrow the bin sizes.

I transformed the scale to log10 to better visualize the distribution.

summery of sulphates :

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

density

Most of the density values are between .99 and 1.00 g / cm3, but there are some outliers near 1.01 and 1.04. With a mean of 0.9937 and a median of 0.994. distribution has a longer right tail than left tail, which can be seen more clearly in the first plot.

summary of Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Free.sulfur.dioxide

There exist so many outliers as most other features. We should trim the outliers to make better analysis. First, lets arrange binwidths to obtain deeper insight.

If we look also the summary statistics:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

There exist extremely large variables similar to other variables. If the top 1 percentile is omitted:

This time, the distribution is quite better and similar to normal. We can see that only very small amount of data have extreme values. The skewness in the data is very low.

Total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The data follows a pattern similar to previous variables. There exist extremely large variables and a few outliers but most of the data has a bell-shaped normal-like distribution. If the top 1 percentile is omitted the below distribution is obtained.

Most of the data values are between 50 and 240.

Both the distribution of free sulfur dioxide and total sulfur look like normal distributions. The free sulfur dioxide distribution has a mean of 35.31 and a median of 34.00, the total sulfur dioxide distribution has a mean of 138.4 and a median of 134.0. In both cases the mean is larger than the median, this difference is more noticeable in the distribution of total sulfur dioxide. In the total sulfur dioxide distribution it appears that width of distribution is larger for values above the meave rather than below the mean. So the free sulfur dioxide graph looks more normal.

What is the structure of your dataset?

4898 observations of wine with 12 variables ###What is/are the main feature(s) of interest in your dataset? The main features in the data are quality, alcohol, residual.sugar, density.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Sulfur dioxide, citric acid, clorides.

Did you create any new variables from existing variables in the dataset?

create new quality variable quality.with 3 group ( Low ,Average , High )

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

citric acid has two unusual peaks which standed out of an otherwise normal distribution.

I didlog transformation on the sulphates distributions, because it was skewed, and the transformations allowed better visualizations of the data.

Bivariate Plots Section

in the first correlation matrix to see the realation between variables

## 
## CORRELATIONS
## ============
## - correlation type:  pearson 
## - correlations shown only when both variables are numeric
## 
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                    .           -0.023       0.289
## volatile.acidity            -0.023                .      -0.149
## citric.acid                  0.289           -0.149           .
## residual.sugar               0.089            0.064       0.094
## chlorides                    0.023            0.071       0.114
## free.sulfur.dioxide         -0.049           -0.097       0.094
## total.sulfur.dioxide         0.091            0.089       0.121
## density                      0.265            0.027       0.150
## pH                          -0.426           -0.032      -0.164
## sulphates                   -0.017           -0.036       0.062
## alcohol                     -0.121            0.068      -0.076
## quality                     -0.114           -0.195      -0.009
## quality.cat                      .                .           .
## quality.bucket                   .                .           .
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.089     0.023              -0.049
## volatile.acidity              0.064     0.071              -0.097
## citric.acid                   0.094     0.114               0.094
## residual.sugar                    .     0.089               0.299
## chlorides                     0.089         .               0.101
## free.sulfur.dioxide           0.299     0.101                   .
## total.sulfur.dioxide          0.401     0.199               0.616
## density                       0.839     0.257               0.294
## pH                           -0.194    -0.090              -0.001
## sulphates                    -0.027     0.017               0.059
## alcohol                      -0.451    -0.360              -0.250
## quality                      -0.098    -0.210               0.008
## quality.cat                       .         .                   .
## quality.bucket                    .         .                   .
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.091   0.265 -0.426    -0.017  -0.121
## volatile.acidity                    0.089   0.027 -0.032    -0.036   0.068
## citric.acid                         0.121   0.150 -0.164     0.062  -0.076
## residual.sugar                      0.401   0.839 -0.194    -0.027  -0.451
## chlorides                           0.199   0.257 -0.090     0.017  -0.360
## free.sulfur.dioxide                 0.616   0.294 -0.001     0.059  -0.250
## total.sulfur.dioxide                    .   0.530  0.002     0.135  -0.449
## density                             0.530       . -0.094     0.074  -0.780
## pH                                  0.002  -0.094      .     0.156   0.121
## sulphates                           0.135   0.074  0.156         .  -0.017
## alcohol                            -0.449  -0.780  0.121    -0.017       .
## quality                            -0.175  -0.307  0.099     0.054   0.436
## quality.cat                             .       .      .         .       .
## quality.bucket                          .       .      .         .       .
##                      quality quality.cat quality.bucket
## fixed.acidity         -0.114           .              .
## volatile.acidity      -0.195           .              .
## citric.acid           -0.009           .              .
## residual.sugar        -0.098           .              .
## chlorides             -0.210           .              .
## free.sulfur.dioxide    0.008           .              .
## total.sulfur.dioxide  -0.175           .              .
## density               -0.307           .              .
## pH                     0.099           .              .
## sulphates              0.054           .              .
## alcohol                0.436           .              .
## quality                    .           .              .
## quality.cat                .           .              .
## quality.bucket             .           .              .

there is a strong correlations between free sulfur dioxide, total sulfur dioxide and the constructed variables bound sulfur dioxide and sulfur dioxide ratio.

It also shows interesting relations between :

1 -residual.sugar vs density 2- and of course I am gioing to analysis the mean Feature Qulity with some intersting variabkes ()

Density VS residual sugar

alcohil vs density :

There is a strong negative correlation between density and alcohol.when the percent of alcohol increases the density will decreases.

Alcohol VS Total.sulfur.dioxide:

This is a moderate positive correlation between density and total sulfur dioxide. We have a moderate negative relationship between alcohol and total sulfur dioxide.

Free.sulfur.dioxide vs Total.sulfur.dioxide :

We observe a moderate positive correlation between total sulfur dioxide and free sulfur dioxide.

PH VS Fixed.acidity :

This correlation makes logical sense because as fixed acidity increases, the pH value becomes more acidic.

Density, Residual.sugar :

Density, Total.sulfur.dioxide :

This is a strong positive correlation between density and residual sugar.

Quality VS Alcohol :

There is a moderate positive correlation between alcohol and quality. We can see that as the alcohol increases the rating slightly increases as well.

trying to apply a linear model to the alcohol and quality scatterplot

## 
## Call:
## lm(formula = I(alcohol) ~ I(quality), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2986 -0.7882 -0.1382  0.8014  4.1223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.95670    0.10626   65.47   <2e-16 ***
## I(quality)   0.60524    0.01788   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.108 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Quality VS Sulphates:

There is no meaningful relationship between sulphates and quality. This wine additive has no impact on quality.

Quality VS PH

There is no Clear relationship between pH and quality.

Quality VS Density

there is a small negative relationship between quality and density. As the density increases, the quality decreases.

Quality VS Total.sulfur.dioxide :

there is a small negative relationship between total sulfur dioxide and quality. This means that as the total sulfur dioxide increases the quality decreases.

Quality VS Free.sulfur.dioxide :

No clear relation between free sulfur dioxide and quality.

Quality VS Chlorides :

There is a small negative correlation between chlorides and quality. If the amount of salt increases the quality decreases.

Quality vs Residual.sugar :

There is no meaningful relationship between residual sugar and quality.

quality VS citric.acid :

There is no meaningful relationship between citric acid and quality.

Quality VS Volatile.acidity:

Small negative correlation between volatile acidity and quality.

Quality vs Fixed Acidity :

There is no clear correlation between fixed acidity and quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

I analyzed the relationships between the variables in this dataset The quality variable which is my main varivale has two largest correlations with alcohol (.436) and density (-.307). and when I got closer to more detail in quality cat and the alcohol level I fount increases from 3 to 5, the quality tends downwards. As the alcohol level increases from 5 to 9, the quality tends upwards.

Did you observe any interesting relationships between the other features

Density is strongly correlated with residual sugar and alcohol.

and there is a relationship between fixed acidity and pH

What was the strongest relationship you found?

The strongest correlations I found are between other features. Strong positive correlation between residual sugar and density, as the amount of sugar increases the density increases. Another strong relationship was observed between density and alcohol.As the percent of alcohol increases the density decreases.

Multivariate Plots Section

In this plot we can see that the average alcohol percent is higher for the wines with higher quality rating.

When I add quality_grouped into alcohol-residual.sugar I observe that high-quality (dark blue points) wines generally have high level alcohol.

An important implication of the graph is that high quality wines generally have low residual.sugar level (less than 5). However, low level of sugar does not mean high quality wines. There exist so many wine types which are low quality and includes low residual.sugar. /n

I will change alcohol to bins to be able to plot density, alcohol, residual sugar and quality together and see how they relate with eachother:

Quality , density ,residual.sugar , Alchole

We observe here a moderate negative relationship between alcohol and total sulfur dioxide wrapped by quality, especially for the wines that are rated above 5.

fixed.acidity and pH, quality.cat

This plot shows pH for each quality in relationship with fixed acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
There is a very interesting relation between density, alcohol, residual sugar and quality. when quality increases as alcohol increases, density decreases and residual sugar decreases.

These variables were amongst the most important ones .

Were there any interesting or surprising interactions between features?

Higher rated wines have a lower total sulfur dioxide, which means that in low concentrations sulfur dioxide is mostly undetectable.

Higher rated wines have a lower amount of sugar than the other rated categories.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

A very interesting relation is shown in this chart. Given a value of residual sugar, density increases as alcohol decreases. This is in some extent due to the fermentation process of winemaking, in which sugar is consumed to generate alcohol. Since alcohol is less dense than water and sugar is more dense than water, this process makes the density of the wine decrease.

Plot Two

Description Two

The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.

Plot Three

Description Three

The negative relationship between alcohol and residual sugar is deteched. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly.


Reflection

on this exploratory data analysis we did 3 steps 1 -a univariate, 2-bivariate and 3- finally multivariate

I did exmaine the variable and realtionships between them there was Some interesting relations came up, like the one between alcohol, density, residual sugar and quality, that could be related to the fermentation process of wine. The correlation between pH and fixed acidity, while not correlating with volatile acidity and citric acids is also worth noting.

The challenges I enocountered were the fact the variables were not clearly explained as they represent chemical properties.

I think it would be interesting to have more even-classed dataset. More low and high quality wines to better visualize these trends. In addition to that I think it would be interesting to see the price of the wine too,

Also the analyze I did were depended on relationships between correlating variables, but there are for sure a non-correlating factors that still need more investigation.